A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora
نویسندگان
چکیده
Because of the importance of proteinprotein interaction (PPI) extraction from text, many corpora have been proposed with slightly differing definitions of proteins and PPI. Since no single corpus is large enough to saturate a machine learning system, it is necessary to learn from multiple different corpora. In this paper, we propose a solution to this challenge. We designed a rich feature vector, and we applied a support vector machine modified for corpus weighting (SVM-CW) to complete the task of multiple corpora PPI extraction. The rich feature vector, made from multiple useful kernels, is used to express the important information for PPI extraction, and the system with our feature vector was shown to be both faster and more accurate than the original kernelbased system, even when using just a single corpus. SVM-CW learns from one corpus, while using other corpora for support. SVM-CW is simple, but it is more effective than other methods that have been successfully applied to other NLP tasks earlier. With the feature vector and SVMCW, our system achieved the best performance among all state-of-the-art PPI extraction systems reported so far.
منابع مشابه
Distributed smoothed tree kernel for protein-protein interaction extraction from the biomedical literature
Automatic extraction of protein-protein interaction (PPI) pairs from biomedical literature is a widely examined task in biological information extraction. Currently, many kernel based approaches such as linear kernel, tree kernel, graph kernel and combination of multiple kernels has achieved promising results in PPI task. However, most of these kernel methods fail to capture the semantic relati...
متن کاملExploiting Grammatical Relations for Protein Relation Extraction and Role Labeling
Automatic protein interaction mining from natural language texts and automatic identification of the agent and target proteins (i.e. role labeling) are challenging problems that attract a lot of attention because of the growing amount of biomedical text resources. We propose a novel approach that relies exclusively on parsing and dependency information. We strategically omit any context informa...
متن کاملDEEPER: A Full Parsing Based Approach to Protein Relation Extraction
Lexical variance in biomedical texts poses a challenge to automatic protein relation mining. We therefore propose a new approach that relies only on more general language structures such as parsing and dependency information for the construction of feature vectors that can be used by standard machine learning algorithms in deciding whether a sentence describes a protein interaction or not. As o...
متن کاملPrediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks
Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...
متن کاملبیـان پروتئین نوترکیب غنی از سـرین انتـامبا هیستولیتیکا
Abstract Backgraound: Entamoeba histolytica antigenic markers such as Serine-Rich E. histolytica protein (SREHP) have recently been used for vaccine preparation, genetic diversity studies of Entamoeba histolytica isolates and for differentiation between E. histolytica and E. dispar species. This study was carried out with the aim of expression of a recombinant Serine Rich E. histolytica prot...
متن کامل